hdfs的API操作

作者: 做个合格的大厂程序员 | 来源:发表于2020-06-13 21:45 被阅读0次

hadoop笔记（四）HDFS的shell和api
通过API访问HDFS
大数据 Hadoop（三）API操作
HDFS API操作
HDFS --API 操作
Hadoop 核心-HDFS
HDFS(三)：HDFS API操作
hdfs的API操作
HDFS支持的访问方式
HDFS中API的使用

*配置Windows下Hadoop环境

经测试mac好像没有这个问题

在windows系统需要配置hadoop运行环境，否则直接运行代码会出现以下问题:

Could not locate executable null \bin\winutils.exe in the hadoop binaries

缺少hadoop.dll

Unable to load native-hadoop library for your platform… using builtin-Java classes where applicable

步骤:

将hadoop2.7.5文件夹拷贝到一个没有中文没有空格的路径下面
在windows上面配置hadoop的环境变量： HADOOP_HOME，并将%HADOOP_HOME%\bin添加到path中
把hadoop2.7.5文件夹中bin目录下的hadoop.dll文件放到系统盘: C:\Windows\System32 目录
关闭windows重启

导入 Maven 依赖

打开InteliJ配置porn.xml中配置Maven依赖

<dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.5</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.5</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.5</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.7.5</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>RELEASE</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                    <!--    <verbal>true</verbal>-->
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.4.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <minimizeJar>true</minimizeJar>
                        </configuration>
                    </execution>
                </executions>
            </plugin>

        </plugins>
    </build>

等待Maven下载完成后即可

使用url方式访问数据（了解）

@Test
public void demo1() throws Exception{
    //第一步：注册hdfs 的url
    URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
    //获取文件输入流
    InputStream inputStream = new URL("hdfs://node1:8020/dir1/123.txt").openStream();
    //获取文件输出流
    FileOutputStream fileOutputStream = new FileOutputStream(new File("/Users/caoxiaozhu/Desktop/hello.txt"));
    //实现文件的拷贝
    IOUtils.copy(inputStream,fileOutputStream);
    //关闭流
    IOUtils.closeQuietly(inputStream);
    IOUtils.closeQuietly(fileOutputStream);
}

使用文件系统方式访问数据（掌握）

在 Java 中操作 HDFS, 主要涉及以下 Class:

Configuration
- 该类的对象封转了客户端或者服务器的配置
FileSystem
- 该类的对象是一个文件系统对象, 可以用该对象的一些方法来对文件进行操作, 通过 FileSystem 的静态方法 get 获得该对象

FileSystem fs = FileSystem.get(conf)

get方法从 conf 中的一个参数 fs.defaultFS 的配置值判断具体是什么类型的文件系统
如果我们的代码中没有指定 fs.defaultFS , 并且工程 ClassPath 下也没有给定相应的配置, conf 中的默认值就来自于 Hadoop 的 Jar 包中的 core-default.xml
默认值为 file:/// , 则获取的不是一个 DistributedFileSystem 的实例, 而是一个本地文件系统的客户端对象

获取 FileSystem 的几种方式

第一种

@Test
public void getFileSystem1() throws IOException {
//        创建Configuration对象
    Configuration configuration = new Configuration();
//        设置文件系统类型
    configuration.set("fs.defaultFS","hdfs://node1:8020");
//        获取指定的文件系统
    FileSystem fileSystem = FileSystem.get(configuration);
//        输出
    System.out.println(fileSystem);
}

//打印结果
DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_492184908_1, ugi=caoxiaozhu (auth:SIMPLE)]]

第二种

@Test public void getFileSystem2() throws Exception{
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration());
    System.out.println("fileSystem:"+fileSystem);
}

第三种

@Test public void getFileSystem3() throws Exception{
    Configuration configuration = new Configuration();
    configuration.set("fs.defaultFS", "hdfs://node1:8020");
    FileSystem fileSystem = FileSystem.newInstance(configuration);
    System.out.println(fileSystem.toString());
}

第四种

@Test 
public void getFileSystem4() throws Exception{
    FileSystem fileSystem = FileSystem.newInstance(new URI("hdfs://node1:8020") ,new Configuration());
    System.out.println(fileSystem.toString()); 
}

遍历 hdfs 中所有文件

使用 API 遍历

@Test 
public void listMyFiles()throws Exception{
    //获取fileSystem类
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration());
    //获取RemoteIterator 得到所有的文件或者文件夹，第一个参数指定遍历的路径，第二个 参数表示是否要递归遍历
    RemoteIterator<LocatedFileStatus> locatedFileStatusRemoteIterator = fileSystem.listFiles(new Path("/dir1"), true);
    while (locatedFileStatusRemoteIterator.hasNext()){
        LocatedFileStatus next = locatedFileStatusRemoteIterator.next();
        System.out.println(next.getPath().toString());
    }

    //关闭文件系统
    fileSystem.close();
}

hdfs 上创建文件夹

@Test
public void mkdirs() throws Exception{

    FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration());

    boolean mkdirs = fileSystem.mkdirs(new Path("/hello/mydir/test"));

    System.out.println("是否创建成功====> "+mkdirs);

    fileSystem.close();
}

hdfs 下载文件

方法一

@Test
public void downloadFile() throws Exception,IOException{
//        1. 获取filesystem
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"),new Configuration());
//        2. 获取hdfs的输入流
    FSDataInputStream inputStream = fileSystem.open(new Path("/dir1/123.txt"));
//        3. 获取本地路径的输出流
    FileOutputStream outputStream = new FileOutputStream("/Users/caoxiaozhu/Desktop/result.txt");
//        4. 文件的拷贝
    IOUtils.copy(inputStream,outputStream);
//        5. 关闭流
    IOUtils.closeQuietly(inputStream);
    IOUtils.closeQuietly(outputStream);
    fileSystem.close();
}

方法二

@Test
public void downloadFile2() throws Exception{

    //创建fileSystem
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"),new Configuration());
    
    //第一个是hdfs的目录，第二个是本地目录
    fileSystem.copyToLocalFile(new Path("/dir1/123.txt"),new Path("/Users/caoxiaozhu/Desktop/test123.txt"));
    
    //关闭流
    fileSystem.close();
}

hdfs 上传文件

@Test
public void uploadFile() throws  Exception{
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"),new Configuration());
    fileSystem.copyFromLocalFile(new Path("/Users/caoxiaozhu/Desktop/test123.txt"),new Path("/dir1"));
//关闭流
    fileSystem.close();
}

用特定某个用户下载

在filesystem后面加上使用的用户

@Test 
public void getConfig()throws Exception{

    FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), new Configuration(),"hadoop");
    
    fileSystem.copyToLocalFile(new Path("/config/core-site.xml"),new Path("file:///c:/core-site.xml"));
    
    fileSystem.close(); 
}

hdfs 小文件的合并

由于 Hadoop 擅长存储大文件，因为大文件的元数据信息比较少，如果 Hadoop 集群当中有大量的小文件，那么每个小文件都需要维护一份元数据信息，会大大的增加集群管理元数据的内存压力，所以在实际工作当中，如果有必要一定要将小文件合并成大文件进行一起处理

在我们的 HDFS 的 Shell 命令模式下，可以通过命令行将很多的 hdfs 文件合并成一个大文件下载到本地

cd /export/servers 

hdfs dfs -getmerge /config/*.xml ./hello.xml

既然可以在下载的时候将这些小文件合并成一个大文件一起下载，那么肯定就可以在上传的时候将小文件合并到一个大文件里面去

image

API调用

@Test
public void mergeFile() throws  Exception{
//        创建fileSystem
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"),new Configuration());
//        获取hdfs大文件流
    FSDataOutputStream outputStream = fileSystem.create(new Path("/dir1/big.txt"));
//        获取一个本地文件系统
    LocalFileSystem localFileSystem = FileSystem.getLocal(new Configuration());

//        获取本地文件夹下所有的文件详情
    FileStatus[] fileStatuses = localFileSystem.listStatus(new Path("/Users/caoxiaozhu/Desktop/big"));

//        遍历每个文件，获取每个文件的输入流
    for (FileStatus fileStatus : fileStatuses){
        FSDataInputStream inputStream = localFileSystem.open(fileStatus.getPath());
//            将笑文件复制到大文件中
        IOUtils.copy(inputStream,outputStream);
        IOUtils.closeQuietly(inputStream);
    }

//        关闭流
    IOUtils.closeQuietly(outputStream);
    localFileSystem.close();
    fileSystem.close();
}

hadoop笔记（四）HDFS的shell和api
前面进行了hdfs原理的学习，下面进行hdfs的shell操作和api操作。1、hdfs命令hadoop的shel...
通过API访问HDFS
通过API操作HDFS 今天的主要内容 HDFS获取文件系统 HDFS文件上传 HDFS文件下载 HDFS目录创建...
大数据 Hadoop（三）API操作
第03章 Hadoop API操作 HDFS操作 Maven配置进行haddop HDFS相关开发首先需要引入下...
HDFS API操作
HDFS --API 操作
Hadoop 核心-HDFS
Hadoop 核心-HDFS 1:HDFS 的 API 操作 1.1 配置Windows下Hadoop环境在wi...
HDFS(三)：HDFS API操作
首先新建一个maven工程，然后编辑pom文件，新增hadoop客户端以及junit的maven坐标。注意hado...
hdfs的API操作
*配置Windows下Hadoop环境经测试mac好像没有这个问题在windows系统需要配置hadoop运行...
HDFS支持的访问方式
√ HDFS Shell命令 √ HDFS Java API √ HDFS REST API √ HDFS ...
HDFS中API的使用
在项目开发中，有时我们需要通过HDFS的api来对文件进行操作，比如将数据上传到HDFS或者从HDFS获取数据等。...

hdfs的API操作

*配置Windows下Hadoop环境

导入 Maven 依赖

使用url方式访问数据（了解）

使用文件系统方式访问数据（掌握）

获取 FileSystem 的几种方式

第一种

第二种

第三种

第四种

遍历 hdfs 中所有文件

hdfs 上创建文件夹

hdfs 下载文件

方法一

方法二

hdfs 上传文件

用特定某个用户下载

hdfs 小文件的合并

相关文章

hadoop笔记（四）HDFS的shell和api

通过API访问HDFS

大数据 Hadoop（三）API操作

HDFS API操作

HDFS --API 操作

Hadoop 核心-HDFS

HDFS(三)：HDFS API操作

hdfs的API操作

HDFS支持的访问方式

HDFS中API的使用

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读