# clean HTML code from a word html
## 1. HtmlRuleSanitizer (Nuget)
#### (可考慮使用)
open-source (MIT)
Nuget: https://www.nuget.org/packages/Vereyon.Web.HtmlSanitizer
Github: https://github.com/Vereyon/HtmlRuleSanitizer
使用方法:
```
using Vereyon.Web;
var sanitizer = HtmlSanitizer.SimpleHtml5Sanitizer();
string cleanHtml = sanitizer.Sanitize(dirtyHtml)
```
參考圖片:

參考文章:
https://stackoverflow.com/questions/2806678/programmatically-clean-word-generated-html-while-preserving-styles
## 2. Cleaning Word's Nasty HTML (function)
#### (*候選)
透過正則表達式處理,沒有使用其他新的dll
```
static void Main(string[] args)
{
if (args.Length == 0 || String.IsNullOrEmpty(args[0]))
{
Console.WriteLine("No filename provided.");
return;
}
string filepath = args[0];
if (Path.GetFileName(filepath) == args[0])
{
filepath = Path.Combine(Environment.CurrentDirectory, filepath);
}
if (!File.Exists(args[0]))
{
Console.WriteLine("File doesn't exist.");
}
string html = File.ReadAllText(filepath);
Console.WriteLine("input html is " + html.Length + " chars");
html = CleanWordHtml(html);
html = FixEntities(html);
filepath = Path.GetFileNameWithoutExtension(filepath) + ".modified.htm";
File.WriteAllText(filepath, html);
Console.WriteLine("cleaned html is " + html.Length + " chars");
}
static string CleanWordHtml(string html)
{
StringCollection sc = new StringCollection();
// get rid of unnecessary tag spans (comments and title)
sc.Add(@"<!--(w|W)+?-->");
sc.Add(@"<title>(w|W)+?</title>");
// Get rid of classes and styles
sc.Add(@"s?class=w+");
sc.Add(@"s+style='[^']+'");
// Get rid of unnecessary tags
sc.Add(@"<(meta|link|/?o:|/?style|/?div|/?std|/?head|/?html|body|/?body|/?span|![)[^>]*?>");
// Get rid of empty paragraph tags
sc.Add(@"(<[^>]+>)+ (</w+>)+");
// remove bizarre v: element attached to <img> tag
sc.Add(@"s+v:w+=""[^""]+""");
// remove extra lines
sc.Add(@"(nr){2,}");
foreach (string s in sc)
{
html = Regex.Replace(html, s, "", RegexOptions.IgnoreCase);
}
return html;
}
static string FixEntities(string html)
{
NameValueCollection nvc = new NameValueCollection();
nvc.Add(""", "“");
nvc.Add(""", "”");
nvc.Add("–", "—");
foreach (string key in nvc.Keys)
{
html = html.Replace(key, nvc[key]);
}
return html;
}
```
參考文章:
https://blog.codinghorror.com/cleaning-words-nasty-html
## 3. 正則表達式 function
也可參考:https://gist.github.com/SamWM/1716310
```
public string CleanHtml(string html)
{
//Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
// Only returns acceptable HTML, and converts line breaks to <br />
// Acceptable HTML includes HTML-encoded entities.
html = html.Replace("&" + "nbsp;", " ").Trim(); //concat here due to SO formatting
// Does this have HTML tags?
if (html.IndexOf("<") >= 0)
{
// Make all tags lowercase
html = Regex.Replace(html, "<[^>]+>", delegate(Match m){
return m.ToString().ToLower();
});
// Filter out anything except allowed tags
// Problem: this strips attributes, including href from a
// http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist
string AcceptableTags = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote";
string WhiteListPattern = "</?(?(?=" + AcceptableTags + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>";
html = Regex.Replace(html, WhiteListPattern, "", RegexOptions.Compiled);
// Make all BR/br tags look the same, and trim them of whitespace before/after
html = Regex.Replace(html, @"\s*<br[^>]*>\s*", "<br />", RegexOptions.Compiled);
}
// No CRs
html = html.Replace("\r", "");
// Convert remaining LFs to line breaks
html = html.Replace("\n", "<br />");
// Trim BRs at the end of any string, and spaces on either side
return Regex.Replace(html, "(<br />)+$", "", RegexOptions.Compiled).Trim();
}
```
## 4. ASPOSE 無法做到clean word html code

##