Re: Is payload the right solution for my problem?
I think I just found the solution. Would the right strategy be to store the original XML content and then use a solr.HTMLStripCharFilterFactory when querying? I just made a quick test and it work, the only problem now is that it also finds the data contained in the XML attribute fields. I think I will put my data into two fields, one containing only the raw data without XML, and one in the original format. Then I search in the raw field but return the original format with the response. The only problem I see here is that I need the double amount of diskspace. Is there a better solution? -- View this message in context: http://lucene.472066.n3.nabble.com/Is-payload-the-right-solution-for-my-problem-tp4063814p4064117.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Is payload the right solution for my problem?
I did some experiments but I think I will end up with the doubled disk space. The Problem is the following: I will search in the fulltext (without the xml content), but I need to know the position of the search result in the fulltext (to display) and in the XML data (to get the attributes associated with the result term). I tried to solve this by using highlighting, and as my experiments show, to use highlighting on both fields they have to be indexed and stored, thus I am ending up with nearly the doubled disk space as my original data. Does solr provide any other options for such a problem? -- View this message in context: http://lucene.472066.n3.nabble.com/Is-payload-the-right-solution-for-my-problem-tp4063814p4064482.html Sent from the Solr - User mailing list archive at Nabble.com.
Problem with PatternReplaceCharFilter
Hi, I have a Problem when using PatternReplaceCharFilter when indexing a field. I created the following field: --> And I created a field that is indexed and stored: I need to index a document with such a structure in this field: Basically I have some sort of XML structure, i need only to search in the "content" attribute, but when highlighting i need to get back to the enclosing XML tags. So with the 3 Regex I want to remove all unwanted tags and tokenize/index only the important data. I know that I could use HTMLStripCharFilterFactory but then also the tag names, attribute names and values get indexed. And I don't want to search in that content too. I read the following in the doc: NOTE: If you produce a phrase that has different length to source string and the field is used for highlighting for a term of the phrase, you will face a trouble. The thing is, why is this the case? When running the analyze from solr admin the CharFilters generate "the content to search in the second content line" which looks perfect, but then the StandardTokenizer gets the start and end positions of the tokens wrong. Why is this the case? Does there exist another solution to my problem? Could I use the following method I saw in the doc of PatternReplaceCharFilter: protected int correct(int currentOff) Documentation: Retrieve the corrected offset. How could I solve such a task? -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-PatternReplaceCharFilter-tp4066869.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with PatternReplaceCharFilter
Honestly, I have no idea how to do that. PatternReplaceCharFilter doesn't seem to have a parameter like preservePositions="true" and optionally fillCharacter=" ". And I don't think I can express this simply as regex. How would I count in a pure regex the length difference before and after the match? Well, the specific problem is, that when highlighting the term positions are wrong and the result is not a valid XML structure that I can handle. I expect something like search in" ee="ff" /> but I can tLineaa="bb" cc="dd" content="the content to search in" ee="ff" /> Thanks for your help. -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-PatternReplaceCharFilter-tp4066869p4066939.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with PatternReplaceCharFilter
Thanks again for your input. In fact I already preprocess the data (concatenation of only the content values) and index it into another field. But my general problem is the following: My data has such a cryptic format and I have to search only within the content values. Therefore I preprocess it and put it into a field. There all works fine (highlighting etc.). The problem now comes from the fact that when getting a hit in that field I need to know the it appeared in to get the attribute values. They define some rules for processing the search result, but it should not be possible to search in them. Therefore I cannot just use the HtmlStripCharFilter. So my idea was the following: indexing my cleaned version and the raw format and make sure that both fields generate the same tokens (this is the hard part). If i need to know the surrounding attribute values i search in the raw version and highlight the matching term. This is the indication for me which attribute values to use. Another option would be to search in the cleaned version and after the search/in my application try to match that position to the one in the raw format based on the highlighted term. But this is very error prone. Both solutions do not seem elegant to me. Any suggestions? -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-PatternReplaceCharFilter-tp4066869p4067265.html Sent from the Solr - User mailing list archive at Nabble.com.