The Cookie Machine - Click here to drag window

DUMMY TEXT - Real text set in assets/js/theCookieMachine.js

Views: 59,629β€…    Votes:  15β€…
Tags: html   bash   html-escape-characters  
Link: πŸ” See Original Answer on Stack Overflow πŸ”—

Title: Bash script to convert from HTML entities to characters
ID: /2017/03/28/Bash-script-to-convert-from-HTML-entities-to-characters
Created: March 28, 2017    Edited:  December 7, 2021
Upload: November 24, 2022    Layout:  post
TOC: false    Navigation:  false    Copy to clipboard:  false

This answer is based on: Short way to escape HTML in Bash? which works fine for grabbing answers (using wget) on Stack Exchange and converting HTML to regular ASCII characters:

sed 's/&nbsp;/ /g; s/&amp;/\&/g; s/&lt;/\</g; s/&gt;/\>/g; s/&quot;/\"/g; s/#&#39;/\'"'"'/g; s/&ldquo;/\"/g; s/&rdquo;/\"/g;'

Edit 1: April 7, 2017 - Added left double quote and right double quote conversion. This is part of bash script that web-scrapes SE answers and compares them to local code files here: Ask Ubuntu - Code Version Control between local files and Ask Ubuntu answers

Edit June 26, 2017

Using sed was taking ~3 seconds to convert HTML to ASCII on a 1K line file from Ask Ubuntu / Stack Exchange. As such I was forced to use Bash built-in search and replace for ~1 second response time.

Here’s the function:

LineOut=""      # Make global
HTMLtoText () {
    LineOut=$1  # Parm 1= Input line
    # Replace external command: Line=$(sed 's/&amp;/\&/g; s/&lt;/\</g; 
    # s/&gt;/\>/g; s/&quot;/\"/g; s/&#39;/\'"'"'/g; s/&ldquo;/\"/g; 
    # s/&rdquo;/\"/g;' <<< "$Line") -- With faster builtin commands.
    LineOut="${LineOut//&nbsp;/ }"
    LineOut="${LineOut//&ldquo;/'"'}" # TODO: ASCII/ISO for opening quote
    LineOut="${LineOut//&rdquo;/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()
⇧ System freezes completely with Intel Bay Trail writing a text file in the terminal with touch  β‡©