数据导入的时候需要考虑哪些因素? No Silver Bullet.

  • 什么数据类型?数据格式是否和已用的查询处理工具兼容?
  • 预估文件大小如何?是否均匀分布?会不会有大量的小文件?
  • schema是否要频繁变更?
  • 查询的时候倾向于就那么几列,还是所有列?
  • 数据导入速度要求,及下游使用(比如查询)可容忍程度如何?
  • 集群平台自建?还是别的厂商?

Data Ingestion Considerations

Timeliness of data ingestion and accessibility

数据导入频率什么级别,需要多久能供下游使用?

classifications timeliness requirements tools storage layer
Macro batch > 15 minutes Sqoop, file transfers HDFS
Microbatch > 2 minutes Sqoop, file transfers HDFS
Near-Real-Time Decision Support > 2 seconds Flume, Kafka HBase, Elasticsearch
Near-Real-Time Event Processing > 100 milliseconds Flume, Kafka HBase, Elasticsearch
Real Time custom custom

Storage format

存储格式主要这几种,纯文本,Hadoop原生的Sequence File,序列化Avro,列存储Parquet和ORC(optimized RCFile)

format size read write feature
Text +++++ + +++++ convenient format, human readable
not support block compression
Store is bulky and not as efficient to query
Sequence File +++ ++ +++ Is row oriented format
Supports splitting even if the data is compressed
Can be used to pack small files in hadoop
Avro ++ +++ ++ Is row oriented binary format
Better schema evolution
Self-describing and language-independent schema(JSON), the file contain the schema in addition to the data
Support block compression and are splittable
Compact and fast binary format, widely used as a serialization platform
Parquet + +++++ + Is column oriented binary format
Slow in writing but fast in reading
Optimized and efficient in terms of disk I/O when specific columns need to be queried
Supports compression
Support limited schema evolution, new columns can be added at the end of the structure
ORC + +++++ + like Parquet, but designed specifically for Hive, not a general-purpose storage format
Not support schema evolution

频繁变更的,比如schema、查询结果字段集,Avro更合适;如果查询结果集字段很固定,对写入速度没有要求,Parquet更合适;只想快速导入到HDFS,Text无疑是最快的,但是通常情况下没有人会这样做,比如xml,not splittable,意味着不能并行处理,再就是很耗磁盘网络IO,通常都要压缩,但是又不支持block compression(压缩后not splittable);ORC是HortonWorks开发的,厂商竞争导致Cloudera Impala用不了,所以Cloudera和twitter一起搞出个Parquet;SequenceFiles类似于csv,但是支持block compression,通常作为MapReduce jobs的过程存储;

格式的选择没那么绝对,选了这个,就不能选那个了,通常都会配合使用,比如processing阶段使用SequenceFiles,query阶段使用Parquet,extract可能选择csv(方便导入数据库)

Codecs

压缩格式如何选择?

codecs size compression speed decompression speed splittable
Snappy +++++ ++++ ++++++ not inherently splittable, need a container format like SequenceFiles or Avro
LZO ++++ ++++ +++++ splittable, requires an additional indexing step
Gzip ++ ++ +++ not inherently splittable, need a container format like SequenceFiles or Avro
using smaller blocks can lead to better performance
bzip2 + + ++ inherently splittable

从表中可以看出,大多数情况hot data使用Snappy即可,如果是纯文本则使用LZO会更方便一些;cold data建议Gzip;追求更高的压缩比建议bzip2

Incremental updates

新数据以什么形式落地?append or need modifying existing data?
只是append,那直接灌到HDFS即可;如果需要merge,对于HDFS来说,需要一个Compact job,对增量数据和现有数据做merge,比如用Hive的left join

Data access pattern

batch processing jobs? Or is random access?
如果是大量的扫描批处理,那HDFS更合适;如果是随机的访问数据,那NoSQL(比如Hbase)更合适

Source system and data structure

数据源类型:RDBMS or Logs?数据类型:structured、semistructured or unstructured?

  • hdfs-file-slurper
    轻量级传输工具,多并发,单机多实例,简单容错处理,LZO压缩,定制脚本等
  • Sqoop
    Hadoop和RDBMS间批量传输利器
  • Flume
    日志收集利器
  • Kafkalogstash
    再加上Elasticsearch, elk日志收集黄金搭档!

Network Bottlenecks

做好监控,网卡瓶颈,传输数据压缩

Network Security

数据脱敏,加密解密;如果用Flume,agents之间传输数据支持加密的;用Kafka就需要额外的处理过程

Failure Handling

容错机制:

  • 传输10个文件,如果有一个失败了,能否知道是哪个,这样重传的时候只需要传这个失败的就行了
  • 用dfs put传输一个10G大小的文件,传完9G时断了,就得重头来
  • 走队列传,可能会导致一条记录duplicate,接收端需要去重

Level of Complexity

复杂程度:不要把问题过于复杂化;比如说就是想简单传几个文件上去,就没必要非得搞个Flume或者Kafka集群,直接fs put不就行了;

好的设计不是大而全,而是简而美;不要总想还能加什么,而是可以砍掉什么

Reference