Yes, using PatternTokenizerFactory. Here's an example field type that if you define a "department" field with this type and do a copyField from "url" to "department, it will end up with the department name alone. It handles embedded punctuation (e.g., dot, dash, and underscore) and mixed case words (breaks into separate words.) It is "text" rather than "string", so you can search on individual name words or a phrase. It also lower-cases the name, but you can skip that step

<fieldType name="pat_url_department_text" class="solr.TextField" sortMissingLast="true">
 <analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="://[^/]*/([^/]*)/" group="1"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
   <filter class="solr.LowerCaseFilterFactory" />
 </analyzer>
</fieldType>






-- Jack Krupansky
-----Original Message----- From: AlessandroF
Sent: Wednesday, June 06, 2012 2:57 AM
To: solr-user@lucene.apache.org
Subject: Extract information from url field

Hi All,
I would like to know if it's possible to set up a field where Solr, after
posting a document, automatically extracts part of the content as a result
of a regexp to field.

e.g.

Having an URL field containing
http://www.myCompany.Com/Department/Service/index.html
congifured as <field name="url" type="url" stored="true" indexed="true"
required="true"/>

after posting It should be splitted like :

<doc>
....
<str name="url">http://www.myCompany.Com/Department/Service/index.html</str>
<str name="department">Department</str>
....
</doc>

Thanks for helping!

Alessandro





--
View this message in context: http://lucene.472066.n3.nabble.com/Extract-information-from-url-field-tp3987913.html Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to