I had a partial success with executing a wget as follows: wget --recursive --page-requisites --no-clobber --html-extension --convert-links --restrict-file-names=windows --domains wiki.apache.org http://wiki.apache.org/solr/ -w 10 -l 5
configuring a web server to serve that location and then indexing it with: java -Ddata=web -Ddelay=0 -Drecursive=5 -jar post.jar http://<solr_url>/solr/ I still have some fixes I need to to like decide how to index the files exactly, maybe a dynamic field that will be indexed and stored. I can then grep the resulting HTML to point the search box back at the solr instance. I didn't test all of it yet, I feel it might work but it feels a little ugly. On Tue, Jan 1, 2013 at 11:32 AM, Upayavira <u...@odoko.co.uk> wrote: > I have permission to provide an export. Right now I'm thinking of it > being a one off dump, without the user dir. If someone wants to research > how to make moin automate it, I at least promise to listen. > > Upayavira > > On Tue, Jan 1, 2013, at 08:10 AM, Alexandre Rafalovitch wrote: >> That's why I think this could be a nice joint project with Apacha Infra. >> They provide Moin export, we build a way to index it with Solr for local >> usage. Start with our own - Solr - project , then sell it to others once >> it >> has been dog-fooded enough. Instant increased Solr exposure to all Apache >> project users..... >> >> Just a thought. >> >> Regards, >> Alex. >> >> Personal blog: http://blog.outerthoughts.com/ >> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch >> - Time is the quality of nature that keeps events from happening all at >> once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) >> >> >> On Tue, Jan 1, 2013 at 7:03 PM, Lance Norskog <goks...@gmail.com> wrote: >> >> > 3 problems: >> > a- he wanted to read it locally. >> > b- crawling the open web is imperfect. >> > c- /browse needs to get at the files with the same URL as the uploader. >> > >> > a and b- Try downloading the whole thing with 'wget'. It has a 'make links >> > point to the downloaded files' option. Wget is great. >> > >> > I have done this by parking my files behind a web server. You can use >> > Tomcat. (I recommend the XAMPP distro: http://www.apachefriends.org/** >> > en/xampp.html <http://www.apachefriends.org/en/xampp.html>). Then, use >> > Erik's command to crawl that server. Use /browse to read it. >> > >> > Looking at this again, it should be possible to add a file system service >> > to the Solr start.jar etc/jetty.xml file. I think I did this once. It would >> > be a handy patch. In fact, this whole thing would make a great blog post. >> > >> > >> > On 12/30/2012 05:05 AM, Erik Hatcher wrote: >> > >> >> Here's a geeky way to do it yourself: >> >> >> >> Fire up Solr 4.x, run this from example/exampledocs: >> >> >> >> java -Ddata=web -Ddelay=2 -Drecursive=1 -jar post.jar >> >> http://wiki.apache.org/solr/ >> >> >> >> (although I do end up getting a bunch of 503's, so maybe this isn't very >> >> reliable yet?) >> >> >> >> Tada: >> >> http://localhost:8983/solr/**collection1/browse<http://localhost:8983/solr/collection1/browse> >> >> >> >> :) >> >> >> >> Erik >> >> >> >> >> >> On Dec 29, 2012, at 16:54 , d_k wrote: >> >> >> >> Hello, >> >>> >> >>> I'm setting up Solr inside an intranet without an internet access and >> >>> I was wondering if there is a way to obtain the data dump of the Solr >> >>> Wiki (http://wiki.apache.org/solr/) for offline viewing and searching. >> >>> >> >>> I understand MoinMoin has an export feature one can use >> >>> (http://moinmo.in/MoinDump and >> >>> http://moinmo.in/**HelpOnMoinCommand/ExportDump<http://moinmo.in/HelpOnMoinCommand/ExportDump>) >> >>> but i'm afraid it needs >> >>> to be executed from within the MoinMoin server. >> >>> >> >>> Is there a way to obtain the result of that command? >> >>> Is there another way to view the solr wiki offline? >> >>> >> >> >> >