0

我知道数据应该是正确的。我无法控制数据,而我的老板只会告诉我,我需要想办法处理别人的错误。所以请不要告诉我数据不好不是我的问题,因为它是。

任何人,这就是我正在看的:

"Words","email@email.com","","4253","57574","FirstName","","LastName, MD","","","576JFJD","","1971","","Words","Address","SUITE "A"","City","State","Zip","Phone","",""

出于保密原因,数据已被清除。

如您所见,数据包含引号,并且其中一些引用字段中有逗号。所以我不能删除它们。但是“Suite A”“”正在抛弃解析器。引号太多了。>.<

我在 Microsoft.VisualBasic.FileIO 命名空间中使用 TextFieldParser 和这些设置:

            parser.HasFieldsEnclosedInQuotes = true;
            parser.SetDelimiters(",");
            parser.TextFieldType = FieldType.Delimited;

错误是

MalformedLineException:无法使用当前分隔符解析第 9871 行。

我想以某种方式清理数据以解决此问题,但我不知道该怎么做。或者也许有办法跳过这条线?尽管我怀疑我的上级不会批准我跳过我们可能需要的数据。

4

6 回答 6

3

如果您只是想摆脱"csv 中的杂散标记,您可以使用以下正则表达式找到它们并将它们替换为'

String sourcestring = "source string to match with pattern";
String matchpattern = @"(?<!^|,)""(?!(,|$))";
String replacementpattern = @"$1'";
Console.WriteLine(Regex.Replace(sourcestring,matchpattern,replacementpattern,RegexOptions.Multiline));

解释:

@"(?<!^|,)""(?!(,|$))";will find 将查找"前面没有字符串开头或 a,且后面没有字符串结尾或 a,

于 2016-08-29T20:43:30.507 回答
2

我不熟悉TextFieldParser。但是CsvHelper,您可以为无效数据添加自定义处理程序:

var config = new CsvConfiguration();
config.IgnoreReadingExceptions = true;
config.ReadingExceptionCallback += (e, row) =>
{
    // you can add some custom patching here if possible
    // or, save the line numbers and add/edit them manually later.
};

using(var file = File.OpenRead(".csv"))
using(var reader = new CsvReader(reader, config))
{
    reader.GetRecords<YourDtoClass>();
}
于 2016-08-29T20:12:17.747 回答
2

我对每个人所说的唯一补充(因为我们都去过那里)是尝试纠正您遇到的每个新代码问题。那里有一些不错的 REGEX 字符串https://www.google.com/?ion=1&espv=2#q=c-sharp+regex+csv+clean或者您可以使用 String.Replace (String.Replace ("\"\"\"","").Replace("\"\","").Replace("\",,","\",") 等)。最终,随着您发现并找到纠正越来越多错误的方法,您的手动恢复率将大大降低(您的大部分不良数据可能来自类似的错误)。干杯!

PS - Idea-ish(已经有一段时间了 - 逻辑可能需要一些调整,因为我是从记忆中写的),但你会明白要点:

public string[] parseCSVWithQuotes(string csvLine,int expectedNumberOfDataPoints)
    {
        string ret = "";
        string thisChar = "";
        string lastChar = "";
        bool needleDown = true;
        for(int i = 0; i < csvLine.Length; i++)
        {
            thisChar = csvLine.Substring(i, 1);
            if (thisChar == "'"&&lastChar!="'")
                needleDown = needleDown == true ? false : true;//when needleDown = true, characters are treated literally
            if (thisChar == ","&&lastChar!=",") {
                if (needleDown)
                {
                    ret += "|";//convert literal comma to pipe so it doesn't cause another break on split
                }else
                {
                    ret += ",";//break on split is intended because the comma is outside the single quote
                }
            }
            if (!needleDown && (thisChar == "\"" || thisChar == "*")) {//repeat for any undesired character or use RegEx
                                                                       //do not add -- this eliminates any undesired characters outside single quotes
            }
            else
            {
                if ((lastChar == "'" || lastChar == "\"" || lastChar == ",") && thisChar == lastChar)
                {
                    //do not add - this eliminates double characters
                }else
                {
                    ret += thisChar;
                    lastChar = thisChar;
                    //this character is not an undesired character, is no a double, is valid.
                }
            }
        }
        //we've cleaned as best we can
        string[] parts = ret.Split(',');
        if(parts.Length==expectedNumberOfDataPoints){
        for(int i = 0; i < parts.Length; i++)
        {
            //go back and replace the temporary pipe with the literal comma AFTER split
            parts[i] = parts[i].Replace("|", ",");
        }

        return parts;
        }else{
            //save ret to bad CSV log
            return null;
        }
    }
于 2016-08-29T20:22:36.577 回答
1

我以前必须这样做,

第一步是使用解析数据string.split(',')

下一步是合并属于一起的段。

我基本上做的是

  • 创建一个表示组合字符串的新列表
  • 如果字符串以引号开头,请将其推送到新列表中
  • 如果它不以引号开头,请将其附加到列表中的最后一个字符串
  • 奖励:当字符串以引号结尾但下一个字符串不以引号开头时抛出异常

根据有关数据中实际出现的内容的规则,您可能必须更改代码以解决此问题。

于 2016-08-29T20:04:10.207 回答
1

CSV 文件格式的核心,每一行是一行,该行中的每个单元格用逗号分隔。在您的情况下,您的格式还包含(非常不幸的)规定,即一对引号内的逗号不算作分隔符,而是数据的一部分。我说非常不幸,因为放错的引号会影响整个行的其余部分,并且由于标准 ASCII 中的引号不区分打开和关闭,因此在不知道原始意图的情况下,您真的无法从中恢复。

也就是说,当您以某种方式记录消息时,知道原始意图的(提供数据的人)可以查看文件并更正错误:

if (parse_line(line, &data)) {
   // save the data
} else {
   // log the error
   fprintf(&stderr, "Bad line: %s", line);
}

而且由于您的引号没有转义换行符,因此您可以在遇到此错误后继续下一行。

附录:如果您的公司有选择(即您的数据正在被公司工具序列化),请不要使用 CSV。使用 XML 或 JSON 之类的具有更明确定义的解析机制的东西。

于 2016-08-29T20:11:07.670 回答
0

我也必须这样做一次。我的方法是通过一条线并跟踪我正在阅读的内容。基本上,我编写了自己的扫描仪,从输入行中截断了令牌,这使我可以完全控制我的错误 .csv 数据。

这就是我所做的:

For each character on a line of input.
 1. when outside of a string meeting a comma => all of the previous string (which can be empty) is a valid token.
 2. when outside of a sting meeting anything but a comma or a quote => now you have a real problem, unquoted tekst => handle as you see fit.
 3. when outside of a string meeing a quote => found a start of string.
 4. when inside of a string meeting a comma => accept the comma as part of the string.
 5. when inside of the string meeting a qoute => trouble starts here, mark this point.
   6. continue and when meeting a comma (skipping white space if desired) close the string, 'unread' the comma and continue. (than will bring you to point 1.)
   7. or continue and when meeting a quote -> obviously, what was read must be part of the string, add it to the string, 'unread' the quote and continue. (that will you bring to point 5)
   8. or continue and find an whitespace, then End Of Line ('\n') -> the last qoute must be the closing quote. accept the string as a value.
   9. or continue and fine non-whitespace, then End Of Line. -> now you have a real problem, you have the start of a string but it is not closed -> handle the error as you see fit.

如果您的 .csv 文件中的字段数是固定的,您可以计算您识别为字段分隔符的逗号,当您看到行尾时,您就知道您是否还有其他问题。

使用从输入行接收到的字符串流,您可以构建一个“干净”的 .csv 行,这样就可以构建一个可以在现有代码中使用的已接受和已清理输入的缓冲区。

于 2016-08-29T20:57:57.867 回答