Avoid wget appending index.html to links -
i trying make static html copy of wordpress site can upload somewhere else, github pages.
i use command:
option 1:
wget -k -r -l 1000 -p -n -f -nh -p ./website http://example.com/website
it downloads entire site etc. main issue here adds "index.html" every single link. understand need view site locally, not required on static website host.
so there way tell wget not modify links , add index.html them?
for example creates:
<a href="blog/2015/07/11/hello-world/index.html">hello world!</a>
on default worpress hello world post.
option 2:
use mirroring command -k convert links:
wget -e -m -p -f -nh -p ./website http://example.com/website
then not apply index.html , retain domain name.
but crawls http://example.com , indexes there. not want that. want /website root (because wordpress multi site). how fix this?
i want rewrite hostname instead of stripping or keeping it. should go http://example.com/website/ (wordpress multi site) http://example.org/ possible or need run sed/awk on files after download?
faced similar problem, solved postprocessing sed.
this replaces occurrences of /index.html' /' comment above indicates redirect occurrs anyway if trailing slash missing, added =)
find ./ -type f -exec sed -i -e "s/\/index\.html'/\/\'/g" {} \;
and monster replaces occurrences of "index.html" or 'index.html' (or "index.html' or 'index.html" ..) ".":
find ./ -type f -exec sed -i -e "s/['\\\"]index\.html['\\\"]/\\\".\\\"/g" {} \;
you can sed doing matches e.g. on index.html command:
sed -n "s/['\\\"]index\.html['\\\"]/'\/'/p" index.html
hope find useful
Comments
Post a Comment