Hi, Have you manage to get the regex for this string in Chinese: 预支款管理及账务处理办法 ?
Regards, Edwin On 4 January 2018 at 18:04, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Hi Emir, > > An example of the string in Chinese is 预支款管理及账务处理办法 > > The number of characters is 12, but the expected length should be 36. > > Regards, > Edwin > > > On 4 January 2018 at 16:21, Emir Arnautović <emir.arnauto...@sematext.com> > wrote: > >> Hi Edwin, >> I don’t have enough knowledge in eastern languages to know what is >> expected number when you as for sting length. Maybe you can try some of >> regex unicode settings and see if you’ll get what you need: try setting >> unicode flag with (?U) or try using regex groups and ranges. If you provide >> example string and expected length, maybe we could provide you regex. >> >> Thanks, >> Emir >> -- >> Monitoring - Log Management - Alerting - Anomaly Detection >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >> >> >> >> > On 4 Jan 2018, at 04:37, Zheng Lin Edwin Yeo <edwinye...@gmail.com> >> wrote: >> > >> > Hi Emir, >> > >> > So this would likely be different from what the operating system >> counts, as >> > the operating system may consider each Chinese characters as 3 to 4 >> bytes. >> > Which is probably why I could not find any record with >> subject:/.{255,}.*/ >> > >> > Is there other tools that we can use to query the length for data that >> are >> > already indexed which are not in the standard English language? (Eg: >> > Chinese, Japanese, etc) >> > >> > Regards, >> > Edwin >> > >> > On 3 January 2018 at 23:51, Emir Arnautović < >> emir.arnauto...@sematext.com> >> > wrote: >> > >> >> Hi Edwin, >> >> I do not know, but my guess would be that each character is counted as >> 1 >> >> in regex regardless how many bytes it takes in used encoding. >> >> >> >> Regards, >> >> Emir >> >> -- >> >> Monitoring - Log Management - Alerting - Anomaly Detection >> >> Solr & Elasticsearch Consulting Support Training - >> http://sematext.com/ >> >> >> >> >> >> >> >>> On 3 Jan 2018, at 16:43, Zheng Lin Edwin Yeo <edwinye...@gmail.com> >> >> wrote: >> >>> >> >>> Thanks for the reply. >> >>> >> >>> I am doing the search on existing data that has already been indexed, >> and >> >>> it is likely to be a one time thing. >> >>> >> >>> This subject:/.{255,}.*/ works for English characters. However, >> there >> >> are >> >>> Chinese characters in some of the records. The length seems to be more >> >> than >> >>> 255, but it does not shows up in the results. >> >>> >> >>> Do you know how the length for Chinese characters and other languages >> are >> >>> being determined? >> >>> >> >>> Regards, >> >>> Edwin >> >>> >> >>> >> >>> On 3 January 2018 at 23:01, Alexandre Rafalovitch <arafa...@gmail.com >> > >> >>> wrote: >> >>> >> >>>> Do that during indexing as Emir suggested. Specifically, use an >> >>>> UpdateRequestProcessor chain, probably with the Clone and FieldLength >> >>>> processors: http://www.solr-start.com/javadoc/solr-lucene/org/ >> >>>> apache/solr/update/processor/FieldLengthUpdateProcessorFactory.html >> >>>> >> >>>> Regards, >> >>>> Alex. >> >>>> >> >>>> On 31 December 2017 at 22:00, Zheng Lin Edwin Yeo < >> edwinye...@gmail.com >> >>> >> >>>> wrote: >> >>>>> Hi, >> >>>>> >> >>>>> Would like to check, if it is possible to query a field which has >> data >> >> of >> >>>>> more than a certain length? >> >>>>> >> >>>>> Like for example, I want to query the field subject that has more >> than >> >>>> 255 >> >>>>> bytes. Is it possible? >> >>>>> >> >>>>> I am currently using Solr 6.5.1. >> >>>>> >> >>>>> Regards, >> >>>>> Edwin >> >>>> >> >> >> >> >> >> >