Re: Extract information from url field

Jack Krupansky Wed, 06 Jun 2012 13:33:34 -0700

Yes, using PatternTokenizerFactory. Here's an example field type that if youdefine a "department" field with this type and do a copyField from "url" to"department, it will end up with the department name alone. It handlesembedded punctuation (e.g., dot, dash, and underscore) and mixed case words(breaks into separate words.) It is "text" rather than "string", so you cansearch on individual name words or a phrase. It also lower-cases the name,but you can skip that step

<fieldType name="pat_url_department_text" class="solr.TextField"sortMissingLast="true">

 <analyzer>

<tokenizer class="solr.PatternTokenizerFactory"pattern="://[^/]*/([^/]*)/" group="1"/><filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"generateNumberParts="1" catenateWords="0" catenateNumbers="0"catenateAll="0" splitOnCaseChange="1"/>

   <filter class="solr.LowerCaseFilterFactory" />
 </analyzer>
</fieldType>







-- Jack Krupansky

-----Original Message-----From: AlessandroF

Sent: Wednesday, June 06, 2012 2:57 AM
To: solr-user@lucene.apache.org
Subject: Extract information from url field

Hi All,
I would like to know if it's possible to set up a field where Solr, after
posting a document, automatically extracts part of the content as a result
of a regexp to field.

e.g.

Having an URL field containing
http://www.myCompany.Com/Department/Service/index.html
congifured as <field name="url" type="url" stored="true" indexed="true"
required="true"/>

after posting It should be splitted like :

<doc>
....
<str name="url">http://www.myCompany.Com/Department/Service/index.html</str>
<str name="department">Department</str>
....
</doc>

Thanks for helping!

Alessandro





--

View this message in context:http://lucene.472066.n3.nabble.com/Extract-information-from-url-field-tp3987913.htmlSent from the Solr - User mailing list archive at Nabble.com.

Re: Extract information from url field

Reply via email to