Yes, think of the starving orphan records…

Ours is an eCommerce system, selling mostly shoes.  We have three levels of 
nested objects representing what we sell:
- Product: Mostly title and description
- Item: A specific color and some other attributes, including price. Products 
have 1 or more Items, Items belong to one product.
- SKU: A specific size and SKU ID. Items have 1 or more SKUs, SKUs belong to 
one Item.
[PRODUCT  [ITEM  [SKU] [SKU] [SKU]] [ITEM [SKU]] ]

Products, items, and SKUs all have ID numbers. One product will never have the 
same ID as another product, but it’s possible for a product to have the same ID 
as an Item or a SKU. And that is the problem.  So the program that creates the 
import file adds a new field called uuid, that is a P, I, or S (for Product, 
Item, or SKU) followed by the ID.  We did it this way because my understanding 
is Solr can’t implement a compound unique key.  The uuid is unique across all 
documents, not just all documents of the same docType.

So in the case of my unique test to see if it would complain if the UUID of a 
document I was inserting was not unique, I grabbed the first few products from 
the full import file, and changed the IDs so they are not duplicates of the 
real data, but left the UUIDs alone, so they are duplicates of the real data, 
which was already loaded.  

My expectation was that when I loaded the data I would get some  error saying 
that UUID was already used.  YOUR expectation is that the record would be 
overwritten.  What actually happened is that the new documents got added with 
their duplicate UUIDs, which is the worst possible case.  This is why I think 
it’s not respecting my uniqueKey setting in schema.xml.

Does that make more sense?  I hope you can help me understand this discrepancy. 
Thanks for your efforts so far.

On 2/2/17, 3:13 PM, "Mikhail Khludnev" <m...@apache.org> wrote:

    David,
    I hardly get the way which IDs are assigned, but beware that repeating
    uniqueKey
    value causes deleting former occurrence. In case of block join index it
    corrupts block structure: parent can't be deleted and left children orphans
    (.. so touching, I'm sorry). Just make sure that number of deleted docs is
    0 at first.
    
    On Thu, Feb 2, 2017 at 6:20 PM, David Kramer <david.kra...@shoebuy.com>
    wrote:
    
    > Thanks, for responding. Mikhail.  There are no deleted documents.  Since
    > I’m fairly new to Solr, one of the things I’ve been paranoid about is I
    > have no way of validating my schema.xml, or know whether Solr is even 
using
    > it (I have evidence it’s not, more below). So for each test, I’ve wiped 
out
    > the index, recreated, and reimported.
    >
    > Back to whether my schema.xml is being used, I mentioned that I had to
    > come up with a compound UUID field of the first character of the docType
    > plus the ID, and we put “<uniqueKey>uuid</uniqueKey>” (was id) in our
    > schema.xml.  Then I deleted and recreated the index and restarted Solr.  
In
    > order to verify it was working, I created an import file that had unique
    > IDs but UUIDs which were duplicates of existing records, and it imported
    > the new records even though the UUIDs existed in the database already.  
I’m
    > not sure if Solr should have produced an error or not. I’ll research that,
    > but I mention that here in case it’s relevant.
    >
    > Thanks.
    >
    > On 2/2/17, 6:10 AM, "Mikhail Khludnev" <m...@apache.org> wrote:
    >
    >     David,
    >
    >     Can you make sure your index doesn't have deleted docs? This  can be
    > seen
    >     in SolrAdmiun.
    >     And can you merge index to avoid having them in the index?
    >
    >     On Thu, Feb 2, 2017 at 12:29 AM, David Kramer <
    > david.kra...@shoebuy.com>
    >     wrote:
    >
    >     >
    >     >
    >     > Some background:
    >     > ·         The data involved is catalog data, with three nested
    > objects:
    >     > Products, Items, and Skus, in that order. We have a docType field on
    > each
    >     > record as a differentiator.
    >     > ·         The "id" field in our data is unique within datatype, but
    > not
    >     > across datatypes. We added a "uuid" field in our program that
    > generates the
    >     > Solr import file that is the id prefixed by the first letter of the
    >     > docType, like P12345. That makes the uuid field unique, and we have
    > that as
    >     > the uniqueKey in our schema.xml.
    >     > ·         We are trying to retrieve the parent Product, and all
    > children
    >     > documents. As such, we are using the ChildDocTransformerFactory
    >     > ([child...]) to retrieve the children along with the parent. We have
    > not
    >     > yet solved the problem of getting items within SKUs as nested
    > documents in
    >     > the results, and we will have to figure that out at some point, but
    > for now
    >     > we get them flattened
    >     > ·         We are building out the proof of concept for this. This is
    > all
    >     > new work, so we are free to change a lot.
    >     > ·         This is Solr 6.0.0, and we are importing in JSON format,
    > if that
    >     > matters
    >     > ·         I submitted this question to StackOverflow<http://
    >     > stackoverflow.com/questions/41969353/solr-querying-nested-
    > documents-with-
    >     > childdoctransformerfactory-get-parent-quer> but haven’t gotten any
    >     > answers yet.
    >     >
    >     >
    >     > Our data looks like this (I've removed some fields for simplicity):
    >     >
    >     > {
    >     >
    >     >   "id": 739063,
    >     >
    >     >   "docType": "Product",
    >     >
    >     >   "uuid": "P739063",
    >     >
    >     >   "_childDocuments_": [
    >     >
    >     >     {
    >     >
    >     >       "id": 1537378,
    >     >
    >     >       "price": 25.45,
    >     >
    >     >       "color": "Blush",
    >     >
    >     >       "docType": "Item",
    >     >
    >     >       "productId": 739063,
    >     >
    >     >       "uuid": "I1537378",
    >     >
    >     >       "_childDocuments_": [
    >     >
    >     >         {
    >     >
    >     >           "id": 12799578,
    >     >
    >     >           "size": "10",
    >     >
    >     >           "width": "W",
    >     >
    >     >           "docType": "Sku",
    >     >
    >     >           "itemId": 1537378,
    >     >
    >     >           "uuid": "S12799578"
    >     >
    >     >         }
    >     >
    >     >       ]
    >     >
    >     >     }
    >     >
    >     > }
    >     >
    >     >
    >     >
    >     > The query to fetch all Products and their children nested inside
    > them is
    >     > q=docType:Product&fl=title,id,docType,[child
    >     > parentFilter=docType:Product]. When I run that query, all is well,
    > and it
    >     > returns the first 10 rows. However, if I fetch more rows by adding,
    > say
    >     > &rows=500, we get the error Parent query yields document which is 
not
    >     > matched by parents filter, docID=XXX.
    >     >
    >     > When we first saw that error, we discovered our id field was not
    > unique
    >     > across document types, so we added the uuid field as mentioned
    > above, which
    >     > is. we also added in our schema.xml file, wiped the core, recreated
    > it, and
    >     > restarted Solr just to make sure it was in effect. We have double
    > checked
    >     > and are sure that the uuid fields are unique.
    >     >
    >     >
    >     >
    >     > In all the search results for that error that I've found, the OP did
    > not
    >     > have a field that could differentiate the different document types,
    > but as
    >     > you see we do. Since both the query and the parentFilter are
    > searching for
    >     > docType:Product I don't see how either could possibly return
    > anything but
    >     > parents. We've also tried adding childFilter=docType:Item and
    >     > childFilter=docType:Sku but that did not help.  I also tried using
    > title:*
    >     > for the filter since only products have titles.
    >     >
    >     >
    >     >
    >     > Is there anything else we can try?
    >     >
    >     > Any explanation of this?
    >     >
    >     > Is it possible that it's not using uuid as the unique identifier 
even
    >     > though it's specified in the schema.xml, and would that even cause
    > this?
    >     >
    >     > Thanks.
    >     >
    >     >
    >     >
    >
    >
    >     --
    >     Sincerely yours
    >     Mikhail Khludnev
    >
    >
    >
    
    
    -- 
    Sincerely yours
    Mikhail Khludnev
    

Reply via email to