Avoid wget appending index.html to links -


i trying make static html copy of wordpress site can upload somewhere else, github pages.

i use command:

option 1:

wget -k -r -l 1000 -p -n -f -nh -p ./website http://example.com/website 

it downloads entire site etc. main issue here adds "index.html" every single link. understand need view site locally, not required on static website host.

so there way tell wget not modify links , add index.html them?

for example creates:

<a href="blog/2015/07/11/hello-world/index.html">hello world!</a> 

on default worpress hello world post.

option 2:

use mirroring command -k convert links:

wget -e -m -p -f -nh -p ./website http://example.com/website 

then not apply index.html , retain domain name.

but crawls http://example.com , indexes there. not want that. want /website root (because wordpress multi site). how fix this?

i want rewrite hostname instead of stripping or keeping it. should go http://example.com/website/ (wordpress multi site) http://example.org/ possible or need run sed/awk on files after download?

faced similar problem, solved postprocessing sed.

this replaces occurrences of /index.html' /' comment above indicates redirect occurrs anyway if trailing slash missing, added =)

find ./ -type f -exec sed -i -e "s/\/index\.html'/\/\'/g" {} \; 

and monster replaces occurrences of "index.html" or 'index.html' (or "index.html' or 'index.html" ..) ".":

find ./ -type f -exec sed -i -e "s/['\\\"]index\.html['\\\"]/\\\".\\\"/g" {} \; 

you can sed doing matches e.g. on index.html command:

sed -n "s/['\\\"]index\.html['\\\"]/'\/'/p" index.html 

hope find useful


Comments

Popular posts from this blog

java - Andrioid studio start fail: Fatal error initializing 'null' -

android - Gradle sync Error:Configuration with name 'default' not found -

StringGrid issue in Delphi XE8 firemonkey mobile app -