This is a good start. Few things to consider. 1. Extract the contents via Tika externally or via Tika Server. 2. Create a canonical “Item” document schema which would have title, metadata, contents, imagePreview (something to consider) , etc. 3. Use the extracted Tika data to populate your index. 4. Unless you need highlighting, only index the actual contents, and store the rest of the fields. 5. Shared File storage is probably ok, but you may want to do with a caching later via Nginx and serve files through it. That way you don’t hit the disk every time.
-- Rahul Singh rahul.si...@anant.us Anant Corporation On May 12, 2018, 10:54 AM -0400, NetUser MSUser <msusernetarchit...@gmail.com>, wrote: > Hi team, > > > We have a business case like the below one. > > > There are nearly 150 GB of docs(pdf/ppt/word/xl/msg) files which are in > stored in a N/w Path as of now. To implement text search on these , we are > planning to use solr search in these. Listed below is the plan. > > 1)Using a high configuration Windows server(16 GB RAM , 1 TB Disk etc) > 2)Keep all the files in this server. > 3)Index all the above docs to solr server(Solr installed in the same > windows server). Will use solr post command to post documents to this > server. > 4)Using a Web application user can further add or remove files to/from > shared path in this server. > 5)Web UI to search the text from these docs. and display the file Names. > User can click and download the files > > Listed are the queries what we have. > > 1)Since we cannot index fields here, (as search is across all text in the > docs of various types. User can search for any text and it might be in XL > or in DOC or in PPT or in .MSG files), whether querying(Rest API from the > Web ) the search data will have any performance hit? > > 2)Is it a right decision to keep the physical files in the Shared folder of > Server itself(as a shared drive) instead of storing it in a DB or any other > storage? > > > Regards, > MS