我正在对实时 twitter 流数据进行一些分析。我听说过 Spark Streaming。我想知道哪种方法最适合实时流数据分析,因为我的数据从源头来得非常快。
2 回答
3
我推荐这个关于这个主题的演讲:http: //fr.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
事实上,apachestorm是一个真正的流式架构,事件是一个一个管理的,如果你想对它们进行分组,你必须为此设计一个拓扑。它在延迟和设计方面是最强大的。但它当然很复杂,你必须正确设计你想要的东西。
另一方面,apache spark 是一种微批处理架构,它类似于 hadoop,但每 x 秒执行一次,在定义的时间窗口上生成微批处理数据。因为它看起来确实像一个批处理解决方案,所以它看起来更简单,如果您不希望延迟 < 几秒就足够了。
于 2014-11-22T18:03:36.397 回答
3
Apart from really nice presentation linked by @zenbeni I would like to add a few specific points based on first hand experience with both Storm and Spark streaming especially about your use case (Twitter Data).
- Twitter itself uses Storm for many parts of their realtime stream processing pipeline. So if the type of processing you want to do is similar, Storm is a good choice.
- Storm's multi language support is great. But it is hard to pass around errors. For example, if you are calling Python code from a Java bolt and an exception happens in your Python bolt, it's not easy to propagate this exception back to Java code.
- If your analysis is based on a single Tweet only, Storm will likely be better. However, if you need to do some aggregate or iterative analytics, you will have to microbatch in Storm as well. This essentially means you have to store state in bunch of your bolts.
- Finally, often one needs to do both stream as well as batch processing. Spark shines when you need to mix stream processing along with batch, interactive and iterative processing. In fact, it's not clear to me how you should do iterative processing Storm.
于 2014-11-22T21:02:38.030 回答