Data Ingestion
Contents
数据导入的时候需要考虑哪些因素? No Silver Bullet.
- 什么数据类型?数据格式是否和已用的查询处理工具兼容?
- 预估文件大小如何?是否均匀分布?会不会有大量的小文件?
- schema是否要频繁变更?
- 查询的时候倾向于就那么几列,还是所有列?
- 数据导入速度要求,及下游使用(比如查询)可容忍程度如何?
- 集群平台自建?还是别的厂商?
…
Data Ingestion Considerations
Timeliness of data ingestion and accessibility
数据导入频率什么级别,需要多久能供下游使用?
classifications | timeliness requirements | tools | storage layer |
---|---|---|---|
Macro batch | > 15 minutes | Sqoop, file transfers | HDFS |
Microbatch | > 2 minutes | Sqoop, file transfers | HDFS |
Near-Real-Time Decision Support | > 2 seconds | Flume, Kafka | HBase, Elasticsearch |
Near-Real-Time Event Processing | > 100 milliseconds | Flume, Kafka | HBase, Elasticsearch |
Real Time | custom | custom |
Storage format
存储格式主要这几种,纯文本,Hadoop原生的Sequence File,序列化Avro,列存储Parquet和ORC(optimized RCFile)
format | size | read | write | feature |
---|---|---|---|---|
Text | +++++ | + | +++++ | convenient format, human readable not support block compression Store is bulky and not as efficient to query |
Sequence File | +++ | ++ | +++ | Is row oriented format Supports splitting even if the data is compressed Can be used to pack small files in hadoop |
Avro | ++ | +++ | ++ | Is row oriented binary format Better schema evolution Self-describing and language-independent schema(JSON), the file contain the schema in addition to the data Support block compression and are splittable Compact and fast binary format, widely used as a serialization platform |
Parquet | + | +++++ | + | Is column oriented binary format Slow in writing but fast in reading Optimized and efficient in terms of disk I/O when specific columns need to be queried Supports compression Support limited schema evolution, new columns can be added at the end of the structure |
ORC | + | +++++ | + | like Parquet, but designed specifically for Hive, not a general-purpose storage format Not support schema evolution |
频繁变更的,比如schema、查询结果字段集,Avro更合适;如果查询结果集字段很固定,对写入速度没有要求,Parquet更合适;只想快速导入到HDFS,Text无疑是最快的,但是通常情况下没有人会这样做,比如xml,not splittable,意味着不能并行处理,再就是很耗磁盘网络IO,通常都要压缩,但是又不支持block compression(压缩后not splittable);ORC是HortonWorks开发的,厂商竞争导致Cloudera Impala用不了,所以Cloudera和twitter一起搞出个Parquet;SequenceFiles类似于csv,但是支持block compression,通常作为MapReduce jobs的过程存储;
格式的选择没那么绝对,选了这个,就不能选那个了,通常都会配合使用,比如processing阶段使用SequenceFiles,query阶段使用Parquet,extract可能选择csv(方便导入数据库)
Codecs
压缩格式如何选择?
codecs | size | compression speed | decompression speed | splittable |
---|---|---|---|---|
Snappy | +++++ | ++++ | ++++++ | not inherently splittable, need a container format like SequenceFiles or Avro |
LZO | ++++ | ++++ | +++++ | splittable, requires an additional indexing step |
Gzip | ++ | ++ | +++ | not inherently splittable, need a container format like SequenceFiles or Avro using smaller blocks can lead to better performance |
bzip2 | + | + | ++ | inherently splittable |
从表中可以看出,大多数情况hot data使用Snappy即可,如果是纯文本则使用LZO会更方便一些;cold data建议Gzip;追求更高的压缩比建议bzip2
Incremental updates
新数据以什么形式落地?append or need modifying existing data?
只是append,那直接灌到HDFS即可;如果需要merge,对于HDFS来说,需要一个Compact job,对增量数据和现有数据做merge,比如用Hive的left join
Data access pattern
batch processing jobs? Or is random access?
如果是大量的扫描批处理,那HDFS更合适;如果是随机的访问数据,那NoSQL(比如Hbase)更合适
Source system and data structure
数据源类型:RDBMS or Logs?数据类型:structured、semistructured or unstructured?
- hdfs-file-slurper
轻量级传输工具,多并发,单机多实例,简单容错处理,LZO压缩,定制脚本等 - Sqoop
Hadoop和RDBMS间批量传输利器 - Flume
日志收集利器 - Kafka和logstash
再加上Elasticsearch, elk日志收集黄金搭档!
Network Bottlenecks
做好监控,网卡瓶颈,传输数据压缩
Network Security
数据脱敏,加密解密;如果用Flume,agents之间传输数据支持加密的;用Kafka就需要额外的处理过程
Failure Handling
容错机制:
- 传输10个文件,如果有一个失败了,能否知道是哪个,这样重传的时候只需要传这个失败的就行了
- 用dfs put传输一个10G大小的文件,传完9G时断了,就得重头来
- 走队列传,可能会导致一条记录duplicate,接收端需要去重
Level of Complexity
复杂程度:不要把问题过于复杂化;比如说就是想简单传几个文件上去,就没必要非得搞个Flume或者Kafka集群,直接fs put不就行了;
好的设计不是大而全,而是简而美;不要总想还能加什么,而是可以砍掉什么