1

我正在使用 Univocity 解析器版本 2.7.3。我有一个 CSV 文件,其中包含 100 万条记录,并且将来可能会增长。我只读取文件中的几个特定列,以下是我的要求:

  • 不要在任何时候将 CSV 内容存储到内存中

  • 如果 CSV 中的纬度或经度列为空/空白,则忽略/跳过 bean 创建

为了满足这些要求,我尝试实现 CsvRoutines,以便 CSV 数据不会复制到内存中。我在“纬度”和“经度”字段上都使用了@Validate 注释,并使用错误处理程序不抛出任何异常,以便在验证失败时跳过记录。

CSV 样本:

#version:1.0
#timestamp:2017-05-29T23:22:22.320Z
#brand:test report    
    network_name,location_name,location_category,location_address,location_zipcode,location_phone_number,location_latitude,location_longitude,location_city,location_state_name,location_state_abbreviation,location_country,location_country_code,pricing_type,wep_key
    "1 Free WiFi","Test Restaurant","Cafe / Restaurant","Marktplatz 18","1233","+41 263 34 05","1212.15","7.51","Basel","test","BE","India","DE","premium",""
    "2 Free WiFi","Test Restaurant","Cafe / Restaurant","Zufikerstrasse 1","1111","+41 631 60 00","11.354","8.12","Bremgarten","test","AG","China","CH","premium",""
    "3 Free WiFi","Test Restaurant","Cafe / Restaurant","Chemin de la Fontaine 10","1260","+41 22 361 69","12.34","11.23","Nyon","Vaud","VD","Switzerland","CH","premium",""
    "!.oist*~","HoistGroup Office","Office","Chemin de I Etang","CH-1211","","","","test","test","GE","Switzerland","CH","premium",""
    "test","tess's Takashiro","Cafe / Restaurant","Test 1-10","870-01","097-55-1808","","","Oita","Oita","OITA","Japan","JP","premium","1234B"

TestDTO.java

@Data
@NoArgsConstructor
@AllArgsConstructor
@JsonIgnoreProperties(ignoreUnknown = true)
public class TestDTO implements Serializable {

    @Parsed(field = "location_name")
    private String name;
    @Parsed(field = "location_address")
    private String addressLine1;
    @Parsed(field = "location_city")
    private String city;
    @Parsed(field = "location_state_abbreviation")
    private String state;
    @Parsed(field = "location_country_code")
    private String country;
    @Parsed(field = "location_zipcode")
    private String postalCode;

    @Parsed(field = "location_latitude")
    @Validate
    private Double latitude;

    @Parsed(field = "location_longitude")
    @Validate
    private Double longitude;

    @Parsed(field = "network_name")
    private String ssid;
}

主.java

 CsvParserSettings parserSettings = new CsvParserSettings();        
        parserSettings.detectFormatAutomatically();
        parserSettings.setLineSeparatorDetectionEnabled(true);
        parserSettings.setHeaderExtractionEnabled(true);
        parserSettings.setSkipEmptyLines(true);
        parserSettings.selectFields("network_name", "location_name","location_address", "location_zipcode",
                "location_latitude", "location_longitude", "location_city","location_state_abbreviation", "location_country_code");

        parserSettings.setProcessorErrorHandler(new RowProcessorErrorHandler() {
            @Override
            public void handleError(DataProcessingException error, Object[] inputRow, ParsingContext context) {
                //do nothing
            }
        });


        CsvRoutines parser = new CsvRoutines(parserSettings);
        ResultIterator<TestDTO, ParsingContext> iterator = parser.iterate(TestDTO.class, new FileReader("c:\\users\\...\\test.csv")).iterator();


        int i=0;
        while(iterator.hasNext()) {
            TestDTO dto = iterator.next();
            if(dto.getLongitude() == null || dto.getLatitude() == null)
                i++;            
        }

        System.out.println("count=="+i);

问题:

我实际上希望计数为零,因为我添加了错误处理程序并且没有抛出数据验证异常,但似乎情况并非如此。我认为@Validate 在遇到纬度或经度为空的记录时会抛出异常(同一记录中的两列也可能为空),该异常在错误处理程序中被处理和忽略/跳过。

基本上我不希望 UniVocity 在堆中创建和映射不必要的 DTO 对象(并导致内存不足),因为传入的 CSV 文件可能有超过 200 或 300k 条纬度/经度为空的记录。

我什至尝试在 @Validate 中添加自定义验证器,但徒劳无功。

有人可以让我知道我在这里缺少什么吗?

4

1 回答 1

1

图书馆的作者在这里。你做的一切都是正确的。这是一个错误,我刚刚在这里打开了这个问题,今天要解决。

当您选择字段时会出现该错误:值的重新排序使验证针对其他内容运行(在我的测试中,它验证了城市而不是纬度)。

在您的情况下,只需添加以下代码行即可正常工作:

parserSettings.setColumnReorderingEnabled(false);

这将在未选择字段的情况下生成带有空值的行,而不是删除空值并重新排序已解析行中的值。它将避免该错误,并使您的程序运行得稍微快一些。

您还需要null在迭代位中进行测试:

TestDTO dto = iterator.next();
if(dto != null) { // dto may come null here due to validation
    if (dto.longitude == null || dto.latitude == null)
        i++;
    }
}

希望这会有所帮助,并感谢您使用我们的解析器!

于 2018-12-17T06:31:09.013 回答