clean HTML code from a word html

# clean HTML code from a word html ## 1. HtmlRuleSanitizer (Nuget) #### (可考慮使用) open-source (MIT) Nuget: https://www.nuget.org/packages/Vereyon.Web.HtmlSanitizer Github: https://github.com/Vereyon/HtmlRuleSanitizer 使用方法: ``` using Vereyon.Web; var sanitizer = HtmlSanitizer.SimpleHtml5Sanitizer(); string cleanHtml = sanitizer.Sanitize(dirtyHtml) ``` 參考圖片： ![](https://i.imgur.com/7BGoPWK.png) 參考文章： https://stackoverflow.com/questions/2806678/programmatically-clean-word-generated-html-while-preserving-styles ## 2. Cleaning Word's Nasty HTML (function) #### (*候選) 透過正則表達式處理，沒有使用其他新的dll ``` static void Main(string[] args) { if (args.Length == 0 || String.IsNullOrEmpty(args[0])) { Console.WriteLine("No filename provided."); return; } string filepath = args[0]; if (Path.GetFileName(filepath) == args[0]) { filepath = Path.Combine(Environment.CurrentDirectory, filepath); } if (!File.Exists(args[0])) { Console.WriteLine("File doesn't exist."); } string html = File.ReadAllText(filepath); Console.WriteLine("input html is " + html.Length + " chars"); html = CleanWordHtml(html); html = FixEntities(html); filepath = Path.GetFileNameWithoutExtension(filepath) + ".modified.htm"; File.WriteAllText(filepath, html); Console.WriteLine("cleaned html is " + html.Length + " chars"); } static string CleanWordHtml(string html) { StringCollection sc = new StringCollection(); // get rid of unnecessary tag spans (comments and title) sc.Add(@""); sc.Add(@"<title>(w|W)+?</title>"); // Get rid of classes and styles sc.Add(@"s?class=w+"); sc.Add(@"s+style='[^']+'"); // Get rid of unnecessary tags sc.Add(@"<(meta|link|/?o:|/?style|/?div|/?std|/?head|/?html|body|/?body|/?span|![)[^>]*?>"); // Get rid of empty paragraph tags sc.Add(@"(<[^>]+>)+ (</w+>)+"); // remove bizarre v: element attached to <img> tag sc.Add(@"s+v:w+=""[^""]+"""); // remove extra lines sc.Add(@"(nr){2,}"); foreach (string s in sc) { html = Regex.Replace(html, s, "", RegexOptions.IgnoreCase); } return html; } static string FixEntities(string html) { NameValueCollection nvc = new NameValueCollection(); nvc.Add(""", "“"); nvc.Add(""", "”"); nvc.Add("Ã¢â‚¬â€œ", "—"); foreach (string key in nvc.Keys) { html = html.Replace(key, nvc[key]); } return html; } ``` 參考文章： https://blog.codinghorror.com/cleaning-words-nasty-html ## 3. 正則表達式 function 也可參考：https://gist.github.com/SamWM/1716310 ``` public string CleanHtml(string html) { //Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel // Only returns acceptable HTML, and converts line breaks to <br /> // Acceptable HTML includes HTML-encoded entities. html = html.Replace("&" + "nbsp;", " ").Trim(); //concat here due to SO formatting // Does this have HTML tags? if (html.IndexOf("<") >= 0) { // Make all tags lowercase html = Regex.Replace(html, "<[^>]+>", delegate(Match m){ return m.ToString().ToLower(); }); // Filter out anything except allowed tags // Problem: this strips attributes, including href from a // http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist string AcceptableTags = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"; string WhiteListPattern = "</?(?(?=" + AcceptableTags + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"; html = Regex.Replace(html, WhiteListPattern, "", RegexOptions.Compiled); // Make all BR/br tags look the same, and trim them of whitespace before/after html = Regex.Replace(html, @"\s*<br[^>]*>\s*", "<br />", RegexOptions.Compiled); } // No CRs html = html.Replace("\r", ""); // Convert remaining LFs to line breaks html = html.Replace("\n", "<br />"); // Trim BRs at the end of any string, and spaces on either side return Regex.Replace(html, "(<br />)+$", "", RegexOptions.Compiled).Trim(); } ``` ## 4. ASPOSE 無法做到clean word html code ![](https://i.imgur.com/5A6kWce.png) ##