Re: Is payload the right solution for my problem?

2013-05-17 Thread jasimop
I think I just found the solution.

Would the right strategy be to store the original XML content and then use a
solr.HTMLStripCharFilterFactory when querying? I just made a quick test and
it work,
the only problem now is that it also finds the data contained in the XML
attribute fields.

I think I will put my data into two fields, one containing only the raw data
without XML, and one 
in the original format. Then I search in the raw field but return the
original format with the response.
The only problem I see here is that I need the double amount of diskspace.
Is there a better solution?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-payload-the-right-solution-for-my-problem-tp4063814p4064117.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Is payload the right solution for my problem?

2013-05-19 Thread jasimop
I did some experiments but I think I will end up with the doubled disk space.

The Problem is the following: I will search in the fulltext (without the xml
content), but I need to know the 
position of the search result in the fulltext (to display) and in the XML
data (to get the attributes associated
with the result term).
I tried to solve this by using highlighting, and as my experiments show, to
use highlighting on both fields
they have to be indexed and stored, thus I am ending up with nearly the
doubled disk space as my original data.

Does solr provide any other options for such a problem?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-payload-the-right-solution-for-my-problem-tp4063814p4064482.html
Sent from the Solr - User mailing list archive at Nabble.com.


Problem with PatternReplaceCharFilter

2013-05-29 Thread jasimop
Hi,

I have a Problem when using PatternReplaceCharFilter when indexing a field.
I created the following field: 

  

-->



  
  



  


And I created a field that is indexed and stored:


I need to index a document with such a structure in this field:


Basically I have some sort of XML structure, i need only to search in the
"content" attribute, but when highlighting i need to get back to the
enclosing XML tags.

So with the 3 Regex I want to remove all unwanted tags and tokenize/index
only the important data.
I know that I could use HTMLStripCharFilterFactory but then also the tag
names, attribute names and values get indexed. And I don't want to search in
that content too.

I read the following in the doc:
NOTE: If you produce a phrase that has different length to source string and
the field is used for highlighting for a term of the phrase, you will face a
trouble. 

The thing is, why is this the case? When running the analyze from solr admin
the CharFilters generate
"the content to search in the second content line" which looks perfect, but
then the StandardTokenizer
gets the start and end positions of the tokens wrong. Why is this the case?
Does there exist another solution to my problem?
Could I use the following method I saw in the doc of
PatternReplaceCharFilter:
protected int correct(int currentOff) Documentation: Retrieve the corrected
offset.

How could I solve such a task?






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-with-PatternReplaceCharFilter-tp4066869.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Problem with PatternReplaceCharFilter

2013-05-29 Thread jasimop
Honestly, I have no idea how to do that.
PatternReplaceCharFilter doesn't seem to have a parameter like
preservePositions="true" and
optionally fillCharacter=" ".
And I don't think I can express this simply as regex. How would I count in a
pure
regex the length difference before and after the match?

Well, the specific problem is, that when highlighting the term positions are
wrong and the
result is not a valid XML structure that I can handle.
I expect something like
search in" ee="ff" />
but I can 
tLineaa="bb" cc="dd" content="the content to search
in" ee="ff" />

Thanks for your help.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-with-PatternReplaceCharFilter-tp4066869p4066939.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Problem with PatternReplaceCharFilter

2013-05-31 Thread jasimop
Thanks again for your input.

In fact I already preprocess the data (concatenation of only the content
values) and index it into another field.

But my general problem is the following: My data has such a cryptic format
and I have to search only within the content values. Therefore I preprocess
it and put it into a field. There all works fine (highlighting etc.).
The problem now comes from the fact that when getting a hit in that field I
need to know the 
it appeared in to get the attribute values. They define some rules for
processing the search result, but it should not be possible to search in
them. Therefore I cannot just use the HtmlStripCharFilter.

So my idea was the following: indexing my cleaned version and the raw format
and make sure that both fields
generate the same tokens (this is the hard part). If i need to know the
surrounding attribute values i search
in the raw version and highlight the matching term. This is the indication
for me which attribute values to use.

Another option would be to search in the cleaned version and after the
search/in my application try to match that position to the one in the raw
format based on the highlighted term. But this is very error prone.

Both solutions do not seem elegant to me.


Any suggestions?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-with-PatternReplaceCharFilter-tp4066869p4067265.html
Sent from the Solr - User mailing list archive at Nabble.com.