Yes, I've managed to "steal" some codes from post.jar to only send rich-text documents format to /update/extract.
I've also change the setting of the Eclipse at Windows -> Preference -> General -> Workspace. Under Text file encoding, select Other, and choose UTF-8. The Eclipse is now able to read the Chinese characters successfully. Thank you for your help. Regards, Edwin On 19 October 2015 at 16:33, Duck Geraint (ext) GBJH < geraint.d...@syngenta.com> wrote: > "The problem for this is that it is indexing all the files regardless of > the formats, instead of just those formats in post.jar. So I guess still > have to "steal" some codes from there to detect the file format?" > > If you've not worked it out yourself yet, try something like: > > http://docs.oracle.com/javase/7/docs/api/java/io/File.html#listFiles(java.io.FilenameFilter) > > http://stackoverflow.com/questions/5751335/using-file-listfiles-with-filenameextensionfilter > > Geraint > > Geraint Duck > Data Scientist > Toxicology and Health Sciences > Syngenta UK > Email: geraint.d...@syngenta.com > > -----Original Message----- > From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] > Sent: 17 October 2015 00:55 > To: solr-user@lucene.apache.org > Subject: Re: Recursively scan documents for indexing in a folder in SolrJ > > Thanks for your advice. I also found this method which so far has been > able to traverse all the documents in the folder and index them in Solr. > > public static void showFiles(File[] files) { > for (File file : files) { > if (file.isDirectory()) { > System.out.println("Directory: " + file.getName()); > showFiles(file.listFiles()); // Calls same method again. > } else { > System.out.println("File: " + file.getName()); > } > }} > > The problem for this is that it is indexing all the files regardless of > the formats, instead of just those formats in post.jar. So I guess still > have to "steal" some codes from there to detect the file format? > > As for files that contains non-English characters (Eg; Chinese > characters), it is currently not able to read the Chinese characters, and > it is all read as a series of "???". Any idea how to solve this problem? > > Thank you. > > Regards, > Edwin > > > On 16 October 2015 at 21:16, Duck Geraint (ext) GBJH < > geraint.d...@syngenta.com> wrote: > > > Also, check this link for SolrJ example code (including the recursion): > > https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ > > > > Geraint > > > > > > Geraint Duck > > Data Scientist > > Toxicology and Health Sciences > > Syngenta UK > > Email: geraint.d...@syngenta.com > > > > -----Original Message----- > > From: Jan Høydahl [mailto:jan....@cominvent.com] > > Sent: 16 October 2015 12:14 > > To: solr-user@lucene.apache.org > > Subject: Re: Recursively scan documents for indexing in a folder in > > SolrJ > > > > SolrJ does not have any file crawler built in. > > But you are free to steal code from SimplePostTool.java related to > > directory traversal, and then index each document found using SolrJ. > > > > Note that SimplePostTool.java tries to be smart with what endpoint to > > post files to, xml, csv and json content will be posted to /update > > while office docs go to /update/extract > > > > -- > > Jan Høydahl, search solution architect Cominvent AS - > > www.cominvent.com > > > > > 16. okt. 2015 kl. 05.22 skrev Zheng Lin Edwin Yeo > > ><edwinye...@gmail.com > > >: > > > > > > Hi, > > > > > > I understand that in SimplePostTool (post.jar), there is this > > > command to automatically detect content types in a folder, and > > > recursively scan it for documents for indexing into a collection: > > > bin/post -c gettingstarted afolder/ > > > > > > This has been useful for me to do mass indexing of all the files > > > that are in the folder. Now that I'm moving to production and plans > > > to use SolrJ to do the indexing as it can do more things like > > > robustness checks and retires for indexes that fails. > > > > > > However, I can't seems to find a way to do the same in SolrJ. Is it > > > possible for this to be done in SolrJ? I'm using Solr 5.3.0 > > > > > > Thank you. > > > > > > Regards, > > > Edwin > > > > > > ________________________________ > > > > > > Syngenta Limited, Registered in England No 2710846;Registered Office : > > Syngenta Limited, European Regional Centre, Priestley Road, Surrey > > Research Park, Guildford, Surrey, GU2 7YH, United Kingdom > > ________________________________ This message may contain > > confidential information. If you are not the designated recipient, > > please notify the sender immediately, and delete the original and any > > copies. Any use of the message by you is prohibited. > > > ________________________________ > > > Syngenta Limited, Registered in England No 2710846;Registered Office : > Syngenta Limited, European Regional Centre, Priestley Road, Surrey Research > Park, Guildford, Surrey, GU2 7YH, United Kingdom > ________________________________ > This message may contain confidential information. If you are not the > designated recipient, please notify the sender immediately, and delete the > original and any copies. Any use of the message by you is prohibited. >