2

我总是在我的 java 程序中使用单义词解析器来比较 csv 文件。它工作得很好,而且速度更快。

但问题是,这一次我试图解析两个不同的具有复杂值的大容量 csv 文件,并在新的 csv 文件中打印差异,

查看其中一个作者示例,我在将 file1 读入列表然后转换为映射后尝试使用 processFile,但在解析时仍然出现错误。

以下是我的示例输入和预期输出文件。

输入 - 文件 1

"h1","h2","h3","h4","h5"
"00000","US","9503.00.0089","USA","9503.0089"
"","EU","9503.00.7000","EUROPEAN UNION","9503.00.7000"
"#1200","US","5601.22.0010","USA","5601.22.0010"
"0180691","US","9503.00.0073","USA","9503.00.0073"
“DRTY01”,”CA”,”9603.01.0088”,”CAN”,”9603.01.0088”

输入 - 文件 2

"h1","h2","h3","h6","h7","h8","h9","h10",h11 
"018890","US","","2015","101","1","1","All",””
"00000","US","9503.00.0090","1986","101","1","1","All","9503.00.0090"
"0180691","US","9503.00.0073","2019","101","1","1","All","9503.00.0073”
“DRTY01”,”CA”,”9603.01.0087”,”2002”,”102”,”1”,”2”,”CA”, “9603.01.0087”

在 file1 和 file2 中选择 h1、h2 常用值,然后比较 file1 的 h3 和 file2 的 h3,如果两个文件 h3 不相等,那么我想打印“h1”、“h4”、“h10”、“h5”、“h11 ”,”h6”,”h7”,”h8”,”h9” 到文件 3

输出 - 文件 3

“h1”,”h4”,” h10”,”h5”, ”h11”,”h6”,”h7”,”h8”,”h9”
"00000","USA”,”All”,”9503.00.0089”,”9503.00.0090”, "1986","101","1","1"   
"DRTY01”,“CAN”,”CA”,”9603.01.0088”,“9603.01.0087”,”2002”,”102”,”1”,”2”
4

1 回答 1

2

我有解决您问题的方法,但请进行回归测试。所以我假设h1 和 h2 结合起来将是一个独特的价值。我正在创建一个 HashMap,其映射为键,csv 文件的整行作为值。我们将覆盖已创建类的 hashcode 和 equals 方法,例如:

  • hashcode 只会使用 h1 和 h2 来生成代码(因为它们肯定是唯一的)
  • 我们将使用 h3 作为比较条件,当两个 h3 相同时将返回 false。

equals 中的逻辑将是 - 如果 map1 和 map2 中的 h1 和 h2 相同,而 h3 不同,则给我 map1 和 map2 中的行。此逻辑在地图中使用了额外的空间,但整体计算逻辑减少到O(N)。下面的代码将为您提供您想要的地图行。我没有正确执行 IO 和异常处理,请相应地处理它们。

测试类

public class UnivocityTest
{

    public static void main(String[] args) throws FileNotFoundException
    {
        // Get data from csv file1
        List<String[]> f1 = getData("example.csv");
        // Get data from csv file2


       List<String[]> f2 = getData("example1.csv");

        // Convert data to a Map with HeaderList class and entire row.
        Map<HeaderList, String[]> map1 = convertAndReturn(f1);
        Map<HeaderList, String[]> map2 = convertAndReturn(f2);

        //Currently prints the required rows.
        compareData(map1, map2);
    }

    // Convert csv to List<String[]>
    private static List<String[]> getData(String file) throws FileNotFoundException
    {
        CsvParserSettings parserSettings = new CsvParserSettings();
        parserSettings.setLineSeparatorDetectionEnabled(true);
        RowListProcessor rowProcessor = new RowListProcessor();
        parserSettings.setProcessor(rowProcessor);
        parserSettings.setHeaderExtractionEnabled(true);

        CsvParser parser = new CsvParser(parserSettings);
        parser.parse(getReader(file));
        // String[] headers = rowProcessor.getHeaders();
        List<String[]> rows = rowProcessor.getRows();

        return rows;
    }

    // get reader object
    private static Reader getReader(String string) throws FileNotFoundException
    {
        // TODO Add proper file handling and exception handling
        return new FileReader(new File(string));
    }

    // Return HashMap
    private static Map<HeaderList, String[]> convertAndReturn(List<String[]> f1)
    {
        Map<HeaderList, String[]> map = new java.util.HashMap<>();

        for (String[] each : f1)
        {
            // For each row in csv create a corresponding HeaderList object with h1,h2 and h3 as key
            // and row as value.
            HeaderList header = new HeaderList(each[0], each[1], each[2]);
            map.put(header, each);
        }

        return map;
    }

    private static void compareData(Map<HeaderList, String[]> map1, Map<HeaderList, String[]> map2)
    {
        // Iterates over the map1 keys one by one. For each key we check if there is a matching key
        // in map2. The matching condition will be h1 and h2 should be same while h3 should be
        // different. Once a key like that is found currently I'm printing both the rows, here you
        // can get the rows you want from the map and return them.

        for (HeaderList each : map1.keySet())
        {
            if (map2.containsKey(each))
            {
//TODO Assume you want columns h3,h4 from file1 and h6  h7 from file2.
                //We know map1 represents file1 with columns h3 and h4 at positions 2 and 3 inside the String[]
                //We know map2 represents file1 with columns h6 and h7 at positions 3 and 4 inside the String[]
                String h3FromFile1 = map1.get(each)[2];
                String h4FromFile1 = map1.get(each)[3];
                String h6FromFile2 = map2.get(each)[3];
                String h7FromFile2 = map2.get(each)[4];
                System.out.println("Required Columns: ");
                System.out.println("h3 file1: "+ h3FromFile1);
                System.out.println("h4 file1: "+ h4FromFile1);
                System.out.println("h6 file2: "+ h6FromFile2);
                System.out.println("h7 file2: " + h7FromFile2);
                System.out.println(Arrays.toString(map1.get(each)));
                System.out.println(Arrays.toString(map2.get(each)));
                System.out.println("-------------------------------");
            }
        }
    }

}

将具有三列 h1、h2、h3 的 bean 类:

class HeaderList
        {

            private String h1;

            private String h2;

            private String h3;

            public HeaderList(String h1, String h2, String h3)
            {
                super();
                this.h1 = h1;
                this.h2 = h2;
                this.h3 = h3;
            }

            /**
             * The hash code method which generate same hashkey for h1 and h2.
             * 
             * @inheritDoc
             */
            @Override
            public int hashCode()
            {
                final int prime = 31;
                int result = 1;
                result = prime * result + ((h1 == null) ? 0 : h1.hashCode());
                result = prime * result + ((h2 == null) ? 0 : h2.hashCode());
                return result;
            }

            /**
             * The equals method assumes each csv file row will be uniquely identified my h1 and h2
             * combined. Please see if h1 and h2 cannot be uniquely identified then it may lead to data
             * loss. For h3 we return true only for same values.
             * 
             * @inheritDoc
             */
            @Override
            public boolean equals(Object obj)
            {
                if (this == obj)
                    return true;
                if (obj == null)
                    return false;
                if (getClass() != obj.getClass())
                    return false;
                HeaderList other = (HeaderList) obj;
                if (h1 == null)
                {
                    if (other.h1 != null)
                        return false;
                }
                else if (!h1.equals(other.h1))
                    return false;
                if (h2 == null)
                {
                    if (other.h2 != null)
                        return false;
                }
                else if (!h2.equals(other.h2))
                    return false;
                if (h3 == null)
                {
                    if (other.h3 == null)
                        return false;
                }
                else if (h3.equals(other.h3))
                    return false;
                return true;
            }

            /**
             * @inheritDoc
             */
            @Override
            public String toString()
            {
                return "HeaderList [h1=" + h1 + ", h2=" + h2 + ", h3=" + h3 + "]";
            }

        }

给定输入 csv 文件的输出:

Required Columns: 
h3 file1: 9603.01.0088
h4 file1: CAN
h6 file2: 2002
h7 file2: 102
[DRTY01, CA, 9603.01.0088, CAN, 9603.01.0088]
[DRTY01, CA, 9603.01.0087, 2002, 102, 1, 2, CA, 9603.01.0087]
-------------------------------
Required Columns: 
h3 file1: 9503.00.0089
h4 file1: USA
h6 file2: 1986
h7 file2: 101
[00000, US, 9503.00.0089, USA, 9503.0089]
[00000, US, 9503.00.0090, 1986, 101, 1, 1, All, 9503.00.0090]
-------------------------------
于 2017-11-26T18:41:06.357 回答